NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Enabling Unstructured Sparse Fine-Tuning and Inference for Foundation Models on Wafer-Scale Engine

https://doi.org/10.1145/3731599.3767395

Zheng, Haoyu; Zeng, Yifan; Song, Linghao; Emani, Murali; Dong, Wenqian (November 2025, ACM)

Free, publicly-accessible full text available November 15, 2026
Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor

Xie, Zhen; Emani, Murali; Yu, Xiaodong; Tao, Dingwen; He, Xin; Su, Pengfei; Zhou, Keren; Vishwanath, Venkatram (July 2024, 2024 USENIX Annual Technical Conference (USENIX ATC 24))

Full Text Available
Cross-Feature Transfer Learning for Efficient Tensor Program Generation

https://doi.org/10.3390/app14020513

Verma, Gaurav; Raskar, Siddhisanket; Emani, Murali; Chapman, Barbara (January 2024, Applied Sciences)

Tuning tensor program generation involves navigating a vast search space to find optimal program transformations and measurements for a program on the target hardware. The complexity of this process is further amplified by the exponential combinations of transformations, especially in heterogeneous environments. This research addresses these challenges by introducing a novel approach that learns the joint neural network and hardware features space, facilitating knowledge transfer to new, unseen target hardware. A comprehensive analysis is conducted on the existing state-of-the-art dataset, TenSet, including a thorough examination of test split strategies and the proposal of methodologies for dataset pruning. Leveraging an attention-inspired technique, we tailor the tuning of tensor programs to embed both neural network and hardware-specific features. Notably, our approach substantially reduces the dataset size by up to 53% compared to the baseline without compromising Pairwise Comparison Accuracy (PCA). Furthermore, our proposed methodology demonstrates competitive or improved mean inference times with only 25–40% of the baseline tuning time across various networks and target hardware. The attention-based tuner can effectively utilize schedules learned from previous hardware program measurements to optimize tensor program tuning on previously unseen hardware, achieving a top-5 accuracy exceeding 90%. This research introduces a significant advancement in autotuning tensor program generation, addressing the complexities associated with heterogeneous environments and showcasing promising results regarding efficiency and accuracy.
more » « less
Full Text Available
Transfer Learning Across Heterogeneous Features For Efficient Tensor Program Generation

https://doi.org/10.1145/3587278.3595644

Verma, Gaurav; Raskar, Siddhisanket; Xie, Zhen; Malik, Abid M; Emani, Murali; Chapman, Barbara (February 2023, ACM)

Full Text Available
FAIR for AI: An interdisciplinary and international community building perspective

https://doi.org/10.1038/s41597-023-02298-6

Huerta, E. A.; Blaiszik, Ben; Brinson, L. Catherine; Bouchard, Kristofer E.; Diaz, Daniel; Doglioni, Caterina; Duarte, Javier M.; Emani, Murali; Foster, Ian; Fox, Geoffrey; et al (December 2023, Scientific Data)

Full Text Available
XUnified: A Framework for Guiding Optimal Use of GPU Unified Memory

https://doi.org/10.1109/ACCESS.2022.3196008

Xu, Hailu; Lin, Pei-Hung; Emani, Murali; Hu, Liting; Liao, Chunhua (January 2022, IEEE Access)
MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

https://doi.org/10.1109/MLHPC54614.2021.00009

Farrell, Steven; Emani, Murali; Balma, Jacob; Drescher, Lukas; Drozd, Aleksandr; Fink, Andreas; Fox, Geoffrey; Kanter, David; Kurth, Thorsten; Mattson, Peter; et al (November 2021, 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC))

Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf ™ is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications, driven by the MLCommons ™ Association. We present the results from the first submission round including a diverse set of some of the world’s largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization and communication scheduling enabling overall >10× (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system’s memory hierarchy and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch-sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O and network behaviour to parameterize extended roofline performance models in future rounds.
more » « less
Full Text Available

Search for: All records